Source code: https://github.com/djlofland/DS621_F2020_Group3/tree/master/Homework_4
Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a checklist of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
Given that the Index column had no impact on the target variable, it was dropped as part of the initial cleaning function. Additionally, the fields “INCOME”, “HOME_VAL”, “OLDCLAIM”, and, “BLUEBOOK”, were imported as characters with “$” leaders and were converted to numeric as part of the initial cleaning function. both the training and evaluation datasets will pass through this treatment.
We compiled summary statistics on our data set to better understand the data before modeling.
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE
## Min. :0.0000 Min. : 0 Min. :0.0000 Min. :16.00
## 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00
## Median :0.0000 Median : 0 Median :0.0000 Median :45.00
## Mean :0.2638 Mean : 1504 Mean :0.1711 Mean :44.79
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00
## Max. :1.0000 Max. :107586 Max. :4.0000 Max. :81.00
## NA's :6
## HOMEKIDS YOJ INCOME PARENT1
## Min. :0.0000 Min. : 0.0 Min. : 0 Length:8161
## 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28097 Class :character
## Median :0.0000 Median :11.0 Median : 54028 Mode :character
## Mean :0.7212 Mean :10.5 Mean : 61898
## 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85986
## Max. :5.0000 Max. :23.0 Max. :367030
## NA's :454 NA's :445
## HOME_VAL MSTATUS SEX EDUCATION
## Min. : 0 Length:8161 Length:8161 Length:8161
## 1st Qu.: 0 Class :character Class :character Class :character
## Median :161160 Mode :character Mode :character Mode :character
## Mean :154867
## 3rd Qu.:238724
## Max. :885282
## NA's :464
## JOB TRAVTIME CAR_USE BLUEBOOK
## Length:8161 Min. : 5.00 Length:8161 Min. : 1500
## Class :character 1st Qu.: 22.00 Class :character 1st Qu.: 9280
## Mode :character Median : 33.00 Mode :character Median :14440
## Mean : 33.49 Mean :15710
## 3rd Qu.: 44.00 3rd Qu.:20850
## Max. :142.00 Max. :69740
##
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Length:8161 Length:8161 Min. : 0
## 1st Qu.: 1.000 Class :character Class :character 1st Qu.: 0
## Median : 4.000 Mode :character Mode :character Median : 0
## Mean : 5.351 Mean : 4037
## 3rd Qu.: 7.000 3rd Qu.: 4636
## Max. :25.000 Max. :57037
##
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 Length:8161 Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 Class :character 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 Mode :character Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.328
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Length:8161
## Class :character
## Mode :character
##
##
##
##
Next, we wanted to get an idea of the distribution profiles for each of the variables. We have two target values, 0 and 1. When building models, we ideally want an equal representation of both classes. As class imbalance deviates, our model performance will suffer both form effects of differential variance between the classes and bias towards picking the more represented class. For logistic regression, if we see a strong imbalance, we can 1) up-sample the smaller group, down-sample the larger group, or adjust our threshold for assigning the predicted value away from 0.5.
| Var1 | Freq |
|---|---|
| 0 | 0.7361843 |
| 1 | 0.2638157 |
The classes are not perfectly balanced, with approximately 73.6% 0’s and 26.4% 1’s. With unbalanced class distributions, it is often necessary to artificially balance the classes to achieve good results. Up-sampling or Down-sampling may be required to achieve class balance with this dataset. We will evaluate model performance accordingly.
Next, we visualize the distribution profiles for each of the predictor variables. This will help us to make a plan on which variable to include, how they might be related to each other or the target, and finally identify outliers or transformations that might help improve model resolution.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution profiles show the prevalence of kurtosis, specifically right skew in variables TRAVTIME, OLDCLAIM, MVR_PTS, TARGET_AMT, INCOME, BLUEBOOK, and approximately normal distributions in YOJ, CARAGE, HOME_VAL, and AGE. when deviations are skewed from traditional normal distribution, this can be problematic for regression assumptions, and thus we might need to transform the data. Under logistic regression, we will need to dummy factor-based variables for the model to understand the data.
While we don’t tackle feature engineering in this analysis, if we were performing a more in-depth analysis, we could leverage the package, mixtools (see R Vignette). This package helps regress mixed models where data can be subdivided into subgroups.
Lastly, several features have both a distribution along with a high number of values at an extreme. However, based on the feature meanings and provided information, there is no reason to believe that any of these extreme values are mistakes, data errors, or otherwise inexplicable. As such, we will not remove the extreme values, as they represent valuable data and could be predictive of the target.
In addition to creating histogram distributions, we also elected to use box-plots to get an idea of the spread of the response variable TARGET_AMT in relation to all of the non-numeric variables. Two sets of boxplots are shown below due to the wide distribution of the response variable. The first set of boxplots highlights the entire range and shows how the cost of car crashes peaks relative to the specific category.
The second set of box plots simply shows these same distributions “zoomed in” by adjusting the axis to allow for a visual of the interquartile range of the response variable relative to each of the categorical predictors.
We wanted to plot scatter plots of each variable versus the target variable to get an idea of the relationship between them. The scatter. There are some notable trends as observed in the scatterplots below such as our response variable TARGET_AMT is likely to be lower when individuals have more kids at home as indicated by the HOMEKIDS feature, and when they have more teenagers driving the car indicated by the feature KIDSDRIV.
Additionally a pairwise comparison plot between all features, both numeric and non-numeric is shown following the scatterplot where this initially implies that there aren’t a significant amount of correlated features and this can give some insight into the expected significance and performing dimensionality reduction on the datasets for the models.
Finally, we can observe the sparsity of information within our dataset by using the DataExplorer package to assess missing information.
We can see that generally, our dataset is in good shape, however, some imputation may be needed for INCOME, YOJ, HOME_VAL, and CAR_AGE.
To summarize our data preparation and exploration, we can distinguish our findings into a few categories below:
All the predictor variables have no missing values and show no indication of incomplete or incorrect data. As such, we have kept all the fields.
Missing values will be imputed with the step_impute functions in the tidy models recipes.
No outliers were removed as all values seemed reasonable.
Finally, as mentioned earlier in our data exploration, and our findings from our histogram plots, we can see that some of our variables are highly skewed. To address this, we decided to perform some transformations to make them more normally distributed. Here are some plots to demonstrate the changes in distributions before and after the transformations:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Using the training data, build at least two different multiple linear regression models and three different binary logistic models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done. Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter-intuitive? Why? The boss needs to know.
first model, no transformations, all nominal predictors included